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Replacement Sheet 



Create sequences of tokens and 
structure metrics to form program 
structure profiles. 
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Compare structure profiles to find 
similar code structures. 
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Compare token sequences within 
matching source code structures 
using a variant of the Longest 
Common Subsequence (LCS) 
algorithm. 
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Indices H, HT 



Figure 1 Prior Art 
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Source code files 



Replacement Sheet 



Phase 1 




Remove comments and string 
constants 



I 



Translate upper-case letters to 
lower-case. 
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Map synonyms to a common form. 



Reorder the functions into their 
calling order. 
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Remove all tokens that are not 
specific programming language 
keywords. 
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Token file pairs 



Phase 2 



Compare pairs of token files. 
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^ Matching pairs 



Figure 2 Prior Art 
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Replacement Sheet 



Phase 1 



Source code files 




Remove whitespace, comments, 
and identifier names. / 



I 



Replace remaining language 
statements by tokens 
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Token 
sequences 



Phase 2 



TP 



Compare token sequences using 
Greedy String Tiling algorithm. 
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-► Matching pairs 



Figure 3 Prior Art 
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Replacement Sheet 



Remove whitespace and 
punctuation from file and convert all 
characters to lower case. 



I 



Divide the remaining non- 
whitespace characters of each file 
into k-grams. 



I 



Hash each k-gram and select a 
subset of all k-grams to be the / 
fingerprints of the document. 
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Compare document fingerprints. 



Figure 4 Prior Art 
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► Matching pairs 



She loves you yeah, yeah, yeah, 
(a) Some text 
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shelo helov elove loves ovesy vesyo esyou syouy youye 
ouyea uyeah yeahy eahye ahyea hyeah yeahy eahye ahyea 
hyeah 

(b) The sequence of 5-grams derived from the text. 
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77 72 42 17 98 50 23 55 6 66 34 24 39 11 84 24 39 11 84 
(c) A hypothetical sequence of hashes of the 5-gram. I 503 



72 24 84 24 84 _ 

(d) The fingerprints - selecting only those hashes that are 0 mod 4. 
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Figure 5 Prior Art 
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